ggplot2require(concurve)
## Loading required package: concurve
require(ggplot2)
## Loading required package: ggplot2
We assume here that there are two variables, each with a total of 100 observations. We may assume that there is no relationship whatsoever between these two variables and that the regression coefficient is equal to 0. We run 100,000 simulations of this test, and compute 100,000 P-values to see the overall distribution.
set.seed <- 1031
n.sim <- 10000
t.sim <- numeric(n.sim)
n.samp <- 100
for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- glm(Y ~ X, data = df)
t.sim[i] <- coef(summary(t))[2, 4]
}
ggplot(data = NULL, aes(x = t.sim)) +
geom_histogram(bins = 30, col = "black", fill = "#3f8f9b", alpha = 0.75) +
labs(
title = "Distribution of P-values Under the Null Hypothesis",
x = "P-value"
) +
theme_bw()
pvalue <- c(
0.99, 0.90, 0.50, 0.25, 0.125, 0.10,
0.05, 0.025, 0.01, 0.005, 0.0001,
0.0000003, 0.000000001
)
svalue <- round((-log2(pvalue)), 2)
table1 <- data.frame(pvalue, svalue)
colnames(table1) <- c("P-value", "S-value")
knitr::kable(table1)
| P-value | S-value |
|---|---|
| 9.90e-01 | 0.01 |
| 9.00e-01 | 0.15 |
| 5.00e-01 | 1.00 |
| 2.50e-01 | 2.00 |
| 1.25e-01 | 3.00 |
| 1.00e-01 | 3.32 |
| 5.00e-02 | 4.32 |
| 2.50e-02 | 5.32 |
| 1.00e-02 | 6.64 |
| 5.00e-03 | 7.64 |
| 1.00e-04 | 13.29 |
| 3.00e-07 | 21.67 |
| 0.00e+00 | 29.90 |
Valid P-values are uniform under the null hypothesis and their corresponding S-values are exponentially distributed. We run the same simulation as before, but then convert the obtained P-values into S-values.
set.seed <- 1031
n.sim <- 10000
t.sim <- numeric(n.sim)
n.samp <- 100
for (i in 1:n.sim) {
X <- rnorm(n.samp, mean = 0, sd = 1)
Y <- rnorm(n.samp, mean = 0, sd = 1)
df <- data.frame(X, Y)
t <- glm(Y ~ X, data = df)
t.sim[i] <- coef(summary(t))[2, 4]
}
ggplot(data = NULL, aes(x = -log2(t.sim))) +
geom_histogram(bins = 30, col = "black", fill = "#d46c5b", alpha = 0.75) +
labs(
title = "Distribution of S-values Under the Null Hypothesis",
x = "S-value (Bits of Information)"
) +
theme_bw()
Despite the name, posterior predictive p-values are not actual P-values because they do not meet this uniformity criterion. Here we fit a simple Bayesian regression model with a very weakly informative prior (normal(0, 10)), where both the predictor and response variable come from the same data-generating mechanism and have the same location and scale parameters. We then calculate the posterior predictive p-value for this and plot it using the Stan plotting functions. Then, we simulate this process 100 times to examine the distribution of posterior predictive p-values and compare them to standard p-values.
## This is bayesplot version 1.7.2
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
## Loading required package: StanHeaders
## rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Loading required package: Rcpp
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
## rstanarm (Version 2.19.3, packaged: 2020-02-11 05:16:41 UTC)
## - Do not expect the default priors to remain the same in future rstanarm versions.
## Thus, R scripts should specify priors explicitly, even if they are just the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores())
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
##
## Attaching package: 'rstanarm'
## The following object is masked from 'package:rstan':
##
## loo
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This was the posterior predictive p-value for one model. Now let’s simulate this process and generate the distribution.
ppp <- rep(NA, 1000)
for (i in 1:length(ppp)) {
y <- rnorm(100, mean = 0, sd = 1)
x <- rnorm(100, mean = 0, sd = 1)
data <- (data.frame(x, y))
mod1 <- stan_glm(y ~ x, data = data, prior = normal(0, 10), verbose = FALSE, refresh = 0)
yrep <- posterior_predict(mod1)
h <- ppc_stat(y, yrep)
ppp[i] <- h[["plot_env"]][["T_y"]]
}
ggplot(data = NULL, aes(x = ppp)) +
geom_histogram(bins = 30, col = "black", fill = "#3f8f9b", alpha = 0.75) +
labs(
title = "Distribution of Posterior Predictive P-values",
x = "Posterior Predictive P-value"
) +
theme_bw()
As we can see, they hardly resemble the uniform appearance of p-value under the null hypothesis of zero effect.
Further, the base-2 log transformation of the vector is not exponentially distributed.
ggplot(data = NULL, aes(x = (-log2(ppp)))) +
geom_histogram(bins = 30, col = "black", fill = "#d46c5b", alpha = 0.75) +
labs(
title = "Distribution of -log2(Posterior Predictive P-values)",
x = "-log2(Posterior Predictive P-value)"
) +
theme_bw()
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning in FUN(X[[i]], ...): NaNs produced
## Warning: Removed 481 rows containing non-finite values (stat_bin).
Here we simulate a study where one group with 100 participants has an average of 100 with a standard deviation of 20 and the second group has the same number of participants but an average of 80 and a standard deviation of 20. We compare them using a t-test and generate 95% intervals several times, specifically 100 times, and then plot them. Since we know the mean difference is 20, we wish to see how often the interval estimates cover this true parameter value.
sim <- function() {
fake <- data.frame((x <- rnorm(100, 100, 20)), (y <- rnorm(100, 80, 20)))
intervals <- t.test(x = x, y = y, data = fake, conf.level = .95)$conf.int[]
}
set.seed(1031)
z <- replicate(100, sim(), simplify = FALSE)
df <- data.frame(do.call(rbind, z))
df$studynumber <- (1:length(z))
intrvl.limit <- c("lower.limit", "upper.limit", "studynumber")
colnames(df) <- intrvl.limit
df$point <- ((df$lower.limit + df$upper.limit) / 2)
df$covered <- (df$lower.limit <= 20 & 20 <= df$upper.limit)
df$coverageprob <- ((as.numeric(table(df$covered)[2]) / nrow(df) * 100))
library(ggplot2)
ggplot(data = df, aes(x = studynumber, y = point, ymin = lower.limit, ymax = upper.limit)) +
geom_pointrange(mapping = aes(color = covered), size = .40) +
geom_hline(yintercept = 20, lty = 1, color = "red", alpha = 0.5) +
coord_flip() +
labs(
title = "Simulated 95% Intervals",
x = "Study Number",
y = "Estimate",
subtitle = "Population Parameter is 20"
) +
theme_bw() + # use a white background
theme(legend.position = "none") +
annotate(
geom = "text", x = 102, y = 30,
label = "Coverage (%) =", size = 2.5, color = "black"
) +
annotate(
geom = "text", x = 102, y = 35,
label = df$coverageprob, size = 2.5, color = "black"
)
Taken from the Brown et al. data DOI: 10.1001/jama.2017.3415
Here we take the reported statistics from the Brown et al. data in order to run statistical tests of different test hypotheses and use those results to construct various functions.
We calculate the standard errors from the point estimate and the confidence (compatibility) limits
se <- log(2.59 / 0.997) / 3.92
logUL <- log(2.59)
logLL <- log(0.997)
logpoint <- log(1.61)
logpoint + (1.96 * se)
## [1] 0.9535654
logpoint - (1.96 * se)
## [1] -0.001097013
testhypothesis <- c(
"Halving of hazard", "No effect (null)", "Point estimate",
"Doubling of hazard", "Tripling of hazard", "Quintupling of hazard"
)
hazardratios <- c(0.5, 1, 1.61, 2, 3, 5)
pvals <- c(1.6e-06, 0.05, 1.00, 0.37, 0.01, 3.2e-06)
svals <- round(-log2(pvals), 3)
lr <- round(c(
exp((((log(0.5 / 1.61)) / se)^2) / (2)),
exp((((log(1 / 1.61)) / se)^2) / (2)),
exp((((log(1.61 / 1.61)) / se)^2) / (2)),
exp((((log(2 / 1.61)) / se)^2) / (2)),
exp((((log(3 / 1.61)) / se)^2) / (2)),
exp((((log(5 / 1.61)) / se)^2) / (2))
), 3)
LR <- formatC(lr, format = "e", digits = 2)
table2 <- data.frame(
testhypothesis, hazardratios,
pvals, svals, LR
)
colnames(table2) <- c(
"Test Hypothesis", "Hazard Ratios",
"P-values", "S-values", "Likelihood Ratio Statistics"
)
knitr::kable(table2)
| Test Hypothesis | Hazard Ratios | P-values | S-values | Likelihood Ratio Statistics |
|---|---|---|---|---|
| Halving of hazard | 0.50 | 1.6e-06 | 19.253 | 1.02e+05 |
| No effect (null) | 1.00 | 5.0e-02 | 4.322 | 6.77e+00 |
| Point estimate | 1.61 | 1.0e+00 | 0.000 | 1.00e+00 |
| Doubling of hazard | 2.00 | 3.7e-01 | 1.434 | 1.49e+00 |
| Tripling of hazard | 3.00 | 1.0e-02 | 6.644 | 2.62e+01 |
| Quintupling of hazard | 5.00 | 3.2e-06 | 18.253 | 5.03e+04 |
ggplot2label <- ("Brown et al.")
point <- (1.61)
lower <- (0.997)
upper <- (2.59)
df <- data.frame(
label, point,
lower, upper
)
Here we plot the 95% interval estimate reported from the high-dimensional propensity score analysis.
library(ggplot2)
ggplot(data = df, mapping = aes(x = label, y = point, ymin = lower, ymax = upper)) +
geom_pointrange(color = "black", size = 1.1) +
geom_hline(yintercept = 1, lty = 1, color = "dark red") +
coord_flip() +
scale_y_log10(breaks = scales::pretty_breaks(n = 10)) +
labs(
title = "Association Between Serotonergic Antidepressant Exposure \nDuring Pregnancy and Child Autism Spectrum Disorder",
subtitle = "95% Compatibility Interval",
caption = "Figure 2",
x = "Study",
y = "Hazard Ratio"
) +
theme_minimal()
In order to use this information to construct a p-value function, we will need to install the concurve R package.
library(concurve)
Enter point estimates and compatibility limits and produce all possible intervals + P-values + S-values. This is calculated assuming normality.
curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59, measure = "ratio", steps = 10000)
## [1] 0.2435408
Stored in data frame “curve1”
Plot the P-value (Compatibility) function of the Brown et al. data with ggplot2 graphics.
ggcurve(curve1[[1]], type = "c", measure = "ratio", nullvalue = T) +
labs(
title = "P-Value (Compatibility) as a Function of the Hazard Ratio",
subtitle = "Association Between Serotonergic Antidepressant Exposure \nDuring Pregnancy and Child Autism Spectrum Disorder",
x = "Hazard Ratio (HR)",
y = "p-value\n(Compatibility)"
) +
geom_vline(xintercept = 1.61, lty = 1, color = "gray", alpha = 0.2) +
geom_vline(xintercept = 2.59, lty = 1, color = "gray", alpha = 0.2) +
theme_light()
We can also see how consistent these results are with previous studies conducted by the same research group, given the overlap of the functions.
curve1 <- curve_rev(point = 1.61, LL = 0.997, UL = 2.59, measure = "ratio", steps = 10000)
## [1] 0.2435408
curve2 <- curve_rev(
point = 1.7, LL = 1.1, UL = 2.6,
type = "c", measure = "ratio", steps = 10000
)
## [1] 0.2194431
lik2 <- curve_rev(
point = 1.7, LL = 1.1, UL = 2.6,
type = "l", measure = "ratio", steps = 10000
)
## [1] 0.2194431
lik1 <- curve_rev(
point = 1.61, LL = 0.997, UL = 2.59,
type = "l", measure = "ratio", steps = 10000
)
## [1] 0.2435408
Let’s compare the relative likelihood functions from both studies from this research group to see how consistent the results are.
(plot_compare(
data1 = lik1[[1]], data2 = lik2[[1]],
type = "l1", measure = "ratio", nullvalue = TRUE,
title = "Brown et al. 2017. J Clin Psychiatry. vs. \nBrown et al. 2017. JAMA.",
subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6 \nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59", xaxis = expression(Theta ~ "= Hazard Ratio / Odds Ratio")
))
and the p-value functions.
(plot_compare(
data1 = curve1[[1]], data2 = curve2[[1]],
type = "c", measure = "ratio", nullvalue = TRUE,
title = "Brown et al. 2017. J Clin Psychiatry. vs. \nBrown et al. 2017. JAMA.",
subtitle = "J Clin Psychiatry: OR = 1.7, 1/6.83 LI: LL = 1.1, UL = 2.6 \nJAMA: HR = 1.61, 1/6.83 LI: LL = 0.997, UL = 2.59", xaxis = expression(Theta ~ "= Hazard Ratio / Odds Ratio")
))
Plot the S-value (Surprisal) function of the Brown et al. data with ggplot2 graphics.
ggcurve(
data = curve1[[1]], type = "s", measure = "ratio", nullvalue = TRUE,
title = "S-Value (Surprisal) as a Function of the Hazard Ratio",
subtitle = "Association Between Serotonergic Antidepressant Exposure \nDuring Pregnancy and Child Autism Spectrum Disorder",
xaxis = "Hazard Ratio", yaxis1 = "S-value (bits of information)"
)
Calculate and Plot Likelihood (Support) Intervals.
hrvalues <- seq(from = 0.65, to = 3.98, by = 0.01)
se <- log(2.59 / 0.997) / 3.92
zscore <- sapply(
hrvalues,
function(i) (log(i / 1.61) / se)
)
support <- exp((-zscore^2) / 2)
likfunction <- data.frame(hrvalues, zscore, support)
ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_ribbon(aes(x = hrvalues, ymin = min(support), ymax = support),
fill = "#239a98", alpha = 0.30
) +
labs(
x = "Hazard Ratio (HR)",
y = "Relative Likelihood 1/MLR"
) +
theme_light() +
theme(
axis.title.x = element_text(size = 13),
axis.title.y = element_text(size = 13)
) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(text = element_text(size = 15)) +
theme(
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 8)
)
upward-concave parabola \(Z^{2}/2\) = \(-ln(MLR)\) which is the likelihood analog of the S-value function
support <- (zscore^2) / 2
likfunction <- data.frame(hrvalues, zscore, support)
ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)),
fill = "#239a98", alpha = 0.30
) +
labs(
x = "Hazard Ratio (HR)",
y = "ln(MLR) "
) +
theme_light() +
theme(
axis.title.x = element_text(size = 13),
axis.title.y = element_text(size = 13)
) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(text = element_text(size = 15)) +
theme(
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 8)
)
support <- (zscore^2)
likfunction <- data.frame(hrvalues, zscore, support)
ggplot(data = likfunction, mapping = aes(x = hrvalues, y = support)) +
geom_line() +
geom_ribbon(aes(x = hrvalues, ymin = support, ymax = max(support)),
fill = "#239a98", alpha = 0.30
) +
labs(
x = "Hazard Ratio (HR)",
y = " Deviance Statistic 2ln(MLR) "
) +
theme_light() +
theme(
axis.title.x = element_text(size = 13),
axis.title.y = element_text(size = 13)
) +
scale_x_log10(breaks = scales::pretty_breaks(n = 10)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme(text = element_text(size = 15)) +
theme(
plot.title = element_text(size = 16),
plot.subtitle = element_text(size = 12),
plot.caption = element_text(size = 8)
)